Inducing Partially Observable Markov Decision Processes
نویسنده
چکیده
In the field of reinforcement learning (Sutton and Barto, 1998; Kaelbling et al., 1996), agents interact with an environment to learn how to act to maximize reward. Two different kinds of environment models dominate the literature—Markov Decision Processes (Puterman, 1994; Littman et al., 1995), or MDPs, and POMDPs, their Partially Observable counterpart (White, 1991; Kaelbling et al., 1998). Both consist of a Markovian state space in which state transitions and immediate rewards are influenced by the action choices of the agent. The difference between the two is that the state is directly observed by the agent in MDPs whereas agents in POMDP environments are only given indirect access to the state via “observations”. This small change to the definition of the model makes a huge difference for the difficulty of the problems of learning and planning. Whereas computing a plan that maximizes reward takes polynomial time in the size of the state space in MDPs (Papadimitriou and Tsitsiklis, 1987), determining the optimal first action to take in a POMDP is undecidable (Madani et al., 2003). The learning problem is not as well studied, but algorithms for learning to approximately optimize an MDP with a polynomial amount of experience have been created (Kearns and Singh, 2002; Strehl et al., 2009), whereas similar results for POMDPs remain elusive. A key observation for learning to obtain near optimal reward in an MDP is that inducing a highly accurate model of an MDP from experience can be a simple matter of counting observed transitions between states under the influence of the selected actions. The critical quantities are all directly observed and simple statistics are enough to reveal their relationships. Learning in more complex MDPs is a matter of properly generalizing the observed experience to novel states (Atkeson et al., 1997) and can often be done provably efficiently (Li et al., 2011). Inducing a POMDP, however, appears to involve a difficult “chicken-and-egg” problem. If a POMDP’s structure is known, it is possible to keep track of the likelihood of occupying each Markovian state at each moment of time while selecting actions and making observations, thus enabling the POMDP’s structure to be learned. But, if the POMDP’s structure is not known in advance, this information is not available, making it unclear how to collect the necessary statistics. Thus, in many ways, the POMDP induction problem has elements in common with grammatical induction. The hidden states, like non-terminals, are important for explaining the structure of observed sequences, but cannot be directly detected. Several different strategies have been used by researchers attempting to induce POMDP models in the context of reinforcement learning. The first work that explicitly introduced
منابع مشابه
Identifying and exploiting weak-information inducing actions in solving POMDPs
We present a method for identifying actions that lead to observations which are only weakly informative in the context of partially observable Markov decision processes (POMDP). We call such actions as weak(inclusive of zero-) information inducing. Policy subtrees rooted at these actions may be computed more efficiently. While zero-information inducing actions may be exploited without error, th...
متن کاملA POMDP Framework to Find Optimal Inspection and Maintenance Policies via Availability and Profit Maximization for Manufacturing Systems
Maintenance can be the factor of either increasing or decreasing system's availability, so it is valuable work to evaluate a maintenance policy from cost and availability point of view, simultaneously and according to decision maker's priorities. This study proposes a Partially Observable Markov Decision Process (POMDP) framework for a partially observable and stochastically deteriorating syste...
متن کاملMDPs Semi - Markov decision processes Hidden Markov models Partially observable SMDPs Hierarchical HMMs
متن کامل
Transition Entropy in Partially Observable Markov Decision Processes
This paper proposes a new heuristic algorithm suitable for real-time applications using partially observable Markov decision processes (POMDP). The algorithm is based in a reward shaping strategy which includes entropy information in the reward structure of a fully observable Markov decision process (MDP). This strategy, as illustrated by the presented results, exhibits near-optimal performance...
متن کاملIncreasing Scalability in Algorithms for Centralized and Decentralized Partially Observable Markov Decision Processes: Efficient Decision-Making and Coordination in Uncertain Environments
INCREASING SCALABILITY IN ALGORITHMS FOR CENTRALIZED AND DECENTRALIZED PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES: EFFICIENT DECISION-MAKING AND COORDINATION IN UNCERTAIN ENVIRONMENTS
متن کاملDeciding the Value 1 Problem for ]-acyclic Partially Observable Markov Decision Processes
The value 1 problem is a natural decision problem in algorithmic game theory. For partially observable Markov decision processes with reachability objective, this problem is defined as follows: are there strategies that achieve the reachability objective with probability arbitrarily close to 1? This problem was shown undecidable recently. Our contribution is to introduce a class of partially ob...
متن کامل